Sebastian Quirarte | sebastianquirajus@gmail.
The aim of this project is to extract, transform, and visualize data from any YouTube channel and all its videos using the YouTube API using Python and PowerBI. In this case I wanted to analyze one of my favorite channels: Kurzgesagt – In a Nutshell.
Data Extraction and Transformation: Extracting data directly from YouTube through their API and storing this data into dataframes.
Data Preprocesing: Converting data types, creating new columns and cleaning dataframes.
Visualization and Analysis Visualizing and analyzing the data obtained from all the videos uploaded by the channel of interest.
Dashboard: Creating a dashboard in PowerBI of the data collected to display the project results in a concise manner.
Note: import missing libraries using pip install in the terminal if necesary.
# General Libraries
from googleapiclient.discovery import build
import pandas as pd
from IPython.display import JSON
from dateutil import parser
import isodate
# Functions written specifically for this project
from youtubeAPI import *
# Data visualization
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
# Natural Lenguage Processing (NLP)
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud
To start, we generate an API key from the Google API console following the steps mentioned at: https://developers.google.com/youtube/v3/getting-started
PLEASE CHANGE THIS KEY TO YOUR OWN PERSONAL KEY IF YOU WANT TO RUN THIS CODE
Note: The quota limit for the YouTube API is 10,000 units per day
api_key = 'AIzaSyAnQgjZ7IL_ZV1qfsjCAnu9t7Hmjn9KE1c' # CHANGE KEY
We'll add the channel ID of the channel we want to analyze. This can be obtained directly from the YouTube channel page url or from online tools available through a simple google search.
channel_id = ['UCsXVk37bltHxD1rDPwtNM8Q'] # Kurzgesagt – In a Nutshell
Let's set up our YouTube API.
api_service_name = "youtube"
api_version = "v3"
# Get credentials and create an API client
youtube = build(
api_service_name, api_version, developerKey=api_key)
We can request the data of the channels we are interested in and view the results in JSON format.
request = youtube.channels().list(
part="snippet,contentDetails,statistics",
id=channel_id
)
response = request.execute()
JSON(response)
<IPython.core.display.JSON object>
After taking a look at the structure of the JSON data, we'll create a function that requests data and stores it as a dictionary inside a list and returns it as a dataframe.
The statistics we're going to extract are:
By calling our function we can extract the data for each of the YouTube channels and view them in a dataframe.
channel_stats = get_channel_stats(youtube, channel_id)
channel_stats
| Channel_Name | Total_Subscribers | Total_Views | Total_Videos | PlaylistID | |
|---|---|---|---|---|---|
| 0 | Kurzgesagt – In a Nutshell | 21500000 | 2527532973 | 198 | UUsXVk37bltHxD1rDPwtNM8Q |
Next, let's create a second function that loops through the channel's video playlist and returns information about all of their video's IDs.
Calling our get_video_ids function returns a list of all video IDs for our specified playlist.
# Gets playlistID from channel_stats dataframe
playlist_id = channel_stats.iloc[0]["PlaylistID"]
# Gets all video IDs from YouTube channel
video_ids = get_video_ids(youtube, playlist_id)
Now that we have all the video IDs from a specified channel, we can request the data from each video using the following function.
video_df = get_video_details(youtube, video_ids)
video_df
| video_id | channelTitle | title | description | tags | publishedAt | viewCount | likeCount | commentCount | duration | definition | caption | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | fuFlMtZmvY0 | Kurzgesagt – In a Nutshell | The Internet is Worse Than Ever – Now What? | Go to https://ground.news/nutshell to get 30% ... | None | 2023-11-29T15:00:43Z | 2871336 | 204225 | 16523 | PT11M32S | hd | true |
| 1 | FgnjdW-x7mQ | Kurzgesagt – In a Nutshell | The Last Thing To Ever Happen In The Universe | Get 50% off your first month of KiwiCo at http... | None | 2023-11-21T15:00:01Z | 3639408 | 198112 | 9440 | PT11M5S | hd | true |
| 2 | tybKnGZRwcU | Kurzgesagt – In a Nutshell | How to Win an Interstellar War | ✨ Join the Galactic Club: your cozy corner in ... | None | 2023-11-15T15:00:01Z | 4489106 | 212122 | 10615 | PT12M43S | hd | true |
| 3 | S7TUe5w6RHo | Kurzgesagt – In a Nutshell | 4.5 Billion Years in 1 Hour | Learn more about how complex life evolved with... | None | 2023-11-10T16:15:06Z | 5986432 | 430706 | 22717 | PT1H3M31S | hd | true |
| 4 | Qtmrp_4BKRQ | Kurzgesagt – In a Nutshell | A Glimpse Into Our Future | ✨ WORLDWIDE SHIPPING AVAILABLE ✨ The 12,024 Hu... | None | 2023-10-17T14:04:14Z | 1481809 | 53767 | 1516 | PT1M42S | hd | true |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 193 | F3QpgXBtDeo | Kurzgesagt – In a Nutshell | How The Stock Exchange Works (For Dummies) | Why are there stocks at all?\n\nEveryday in th... | [Stock market, wallstreet, stock exchange, mar... | 2013-11-28T17:03:32Z | 8297195 | 130534 | 7270 | PT3M34S | hd | true |
| 194 | UuGrBhK2c7U | Kurzgesagt – In a Nutshell | The Gulf Stream Explained | Learn about the role of the sea in global warm... | [global conveyor belt, climate change, global ... | 2013-10-11T19:11:39Z | 5938614 | 64868 | 1984 | PT5M4S | hd | true |
| 195 | Uti2niW2BRA | Kurzgesagt – In a Nutshell | Fracking explained: opportunity or danger | Fracking explained in five minutes.\n\nFrackin... | [Fracking, controversy, oil, gas, nature, wate... | 2013-09-03T09:12:24Z | 7254944 | 101461 | 8073 | PT5M3S | hd | true |
| 196 | KsF_hdjWJjo | Kurzgesagt – In a Nutshell | The Solar System -- our home in space | An Infographic trip through the wonders of the... | [Solar system, planets, earth, mars, moon, jup... | 2013-08-22T13:24:56Z | 6231652 | 83643 | 6167 | PT7M21S | hd | true |
| 197 | hOfRN0KihOU | Kurzgesagt – In a Nutshell | How Evolution works | The mechanisms of evolution explained in one v... | [evolution, science, darwin, design, infograph... | 2013-07-11T14:09:52Z | 10605269 | 244000 | 75674 | PT11M48S | hd | true |
198 rows × 12 columns
This is our main dataframe we will visualize and analyze, but first we need to pre-process our data.
Let's take a look at the null values in our data.
# Check for NULL values
video_df.isnull().any()
video_id False channelTitle False title False description False tags True publishedAt False viewCount False likeCount False commentCount False duration False definition False caption False dtype: bool
There are null values only in our tags column, meaning that some of this channel's videos don't have any tags. Let's check the data types of the columns of our dataframe.
# Check data types
video_df.dtypes
video_id object channelTitle object title object description object tags object publishedAt object viewCount object likeCount object commentCount object duration object definition object caption object dtype: object
'Object' type doesn't make sense for some of our columns such as view count, let's fix that and convert the necessary columns into numeric format.
# Convert count columns to numeric
numeric_cols = ['viewCount', 'likeCount', 'commentCount']
video_df[numeric_cols] = video_df[numeric_cols].apply(pd.to_numeric, errors = 'coerce', axis = 1)
Now, let's determine the day of the week that each video was published and add these days as a new column.
# Publish day of the week
video_df['publishedAt'] = video_df['publishedAt'].apply(lambda x: parser.parse(x))
video_df['pushblishDayName'] = video_df['publishedAt'].apply(lambda x: x.strftime("%A"))
We'll also convert the YouTube time format to seconds as a new column. For this we can use the isodate library.
# Convert YouTube duration to seconds
video_df['durationSecs'] = video_df['duration'].apply(lambda x: isodate.parse_duration(x))
video_df['durationSecs'] = video_df['durationSecs'].apply(lambda x: x.total_seconds())
Finally, let's add a column to our dataframe that shows the total number of tags that each video has (NULL values will be shown as 0).
# Add tag count
video_df['tagCount'] = video_df['tags'].apply(lambda x: 0 if x is None else len(x))
Here's our final dataframe.
video_df
| video_id | channelTitle | title | description | tags | publishedAt | viewCount | likeCount | commentCount | duration | definition | caption | pushblishDayName | durationSecs | tagCount | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | fuFlMtZmvY0 | Kurzgesagt – In a Nutshell | The Internet is Worse Than Ever – Now What? | Go to https://ground.news/nutshell to get 30% ... | None | 2023-11-29 15:00:43+00:00 | 2871336 | 204225 | 16523 | PT11M32S | hd | true | Wednesday | 692.0 | 0 |
| 1 | FgnjdW-x7mQ | Kurzgesagt – In a Nutshell | The Last Thing To Ever Happen In The Universe | Get 50% off your first month of KiwiCo at http... | None | 2023-11-21 15:00:01+00:00 | 3639408 | 198112 | 9440 | PT11M5S | hd | true | Tuesday | 665.0 | 0 |
| 2 | tybKnGZRwcU | Kurzgesagt – In a Nutshell | How to Win an Interstellar War | ✨ Join the Galactic Club: your cozy corner in ... | None | 2023-11-15 15:00:01+00:00 | 4489106 | 212122 | 10615 | PT12M43S | hd | true | Wednesday | 763.0 | 0 |
| 3 | S7TUe5w6RHo | Kurzgesagt – In a Nutshell | 4.5 Billion Years in 1 Hour | Learn more about how complex life evolved with... | None | 2023-11-10 16:15:06+00:00 | 5986432 | 430706 | 22717 | PT1H3M31S | hd | true | Friday | 3811.0 | 0 |
| 4 | Qtmrp_4BKRQ | Kurzgesagt – In a Nutshell | A Glimpse Into Our Future | ✨ WORLDWIDE SHIPPING AVAILABLE ✨ The 12,024 Hu... | None | 2023-10-17 14:04:14+00:00 | 1481809 | 53767 | 1516 | PT1M42S | hd | true | Tuesday | 102.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 193 | F3QpgXBtDeo | Kurzgesagt – In a Nutshell | How The Stock Exchange Works (For Dummies) | Why are there stocks at all?\n\nEveryday in th... | [Stock market, wallstreet, stock exchange, mar... | 2013-11-28 17:03:32+00:00 | 8297195 | 130534 | 7270 | PT3M34S | hd | true | Thursday | 214.0 | 21 |
| 194 | UuGrBhK2c7U | Kurzgesagt – In a Nutshell | The Gulf Stream Explained | Learn about the role of the sea in global warm... | [global conveyor belt, climate change, global ... | 2013-10-11 19:11:39+00:00 | 5938614 | 64868 | 1984 | PT5M4S | hd | true | Friday | 304.0 | 19 |
| 195 | Uti2niW2BRA | Kurzgesagt – In a Nutshell | Fracking explained: opportunity or danger | Fracking explained in five minutes.\n\nFrackin... | [Fracking, controversy, oil, gas, nature, wate... | 2013-09-03 09:12:24+00:00 | 7254944 | 101461 | 8073 | PT5M3S | hd | true | Tuesday | 303.0 | 28 |
| 196 | KsF_hdjWJjo | Kurzgesagt – In a Nutshell | The Solar System -- our home in space | An Infographic trip through the wonders of the... | [Solar system, planets, earth, mars, moon, jup... | 2013-08-22 13:24:56+00:00 | 6231652 | 83643 | 6167 | PT7M21S | hd | true | Thursday | 441.0 | 28 |
| 197 | hOfRN0KihOU | Kurzgesagt – In a Nutshell | How Evolution works | The mechanisms of evolution explained in one v... | [evolution, science, darwin, design, infograph... | 2013-07-11 14:09:52+00:00 | 10605269 | 244000 | 75674 | PT11M48S | hd | true | Thursday | 708.0 | 17 |
198 rows × 15 columns
We can export this dataframe as its own csv file for ease of use in Jupyter Lab.
# Exports dataframe as csv file
video_df.to_csv('video_df.csv')
Let's also export the channel stats as a csv file.
# Exports dataframe as csv file
channel_stats.to_csv('channel_df.csv')
Let's create a barplot of the channel's 10 most-viewed videos.
# Format plot size for larger plots
plt.rcParams['figure.figsize'] = [10, 10]
# Creates barplot and sets fundamental parameters
ax = sns.barplot(
x = 'viewCount',
y = 'title',
data = video_df.sort_values('viewCount', ascending=False)[0:9],
palette="crest",
hue="title"
)
# Divides x-axis values by 1,000,000 to convert views to millions
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos:'{:,.0f}'.format(x/1000000) + ' M'))
# Labels
plt.xlabel('Views', fontsize='30', labelpad=20)
plt.ylabel('')
# Title
plt.title('Most Viewed Videos', fontsize='50', pad=30)
# y-axis values size
plt.rc('ytick', labelsize=20)
# x-axis values size
plt.rc('xtick', labelsize=20)
Also the 10 least-viewed videos.
# Creates barplot and sets fundamental parameters
ax = sns.barplot(
x = 'viewCount',
y = 'title',
data = video_df.sort_values('viewCount')[0:9],
palette = 'crest_r',
hue = 'title'
)
# Divides x-axis values by 1,000,000 to convert views to millions
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos:'{:,.0f}'.format(x/1000) + 'K'))
# Labels
plt.xlabel('Views', fontsize='30', labelpad=20)
plt.ylabel('')
# Title
plt.title('Least Viewed Videos', fontsize='50', pad=30)
# y-axis values size
plt.rc('xtick', labelsize=20)
# x-axis values size
plt.rc('xtick', labelsize=20)
We can see that 60% (6/10) of the least viewed videos are YouTube shorts, this analysis considers both full-length videos and shorts in the same category.
A violin plot can let us visualize the distribution of the views of all the channel's videos.
# Creates violin plot and sets x and y values
ax = sns.violinplot(
x = video_df['viewCount'],
color = 'mediumseagreen',
cut = 0
)
# Set transparency (alpha)
plt.setp(ax.collections, alpha=.5)
# Divides x-axis values by 1,000,000 to convert views to millions
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos:'{:,.0f}'.format(x/1000000) + ' M'))
# Labels
plt.xlabel('Views', fontsize='30', labelpad=20)
plt.ylabel('')
# Title
plt.title('Views Across Videos', fontsize='40', pad=30)
# y-axis values size
plt.rc('ytick', labelsize=20)
Let's also calculate the mean, median, max, and minimum values of the views.
# Creates dataframe with views values
views_data = [{"Mean" : format(video_df['viewCount'].mean()/1000000,".2f"),
"Median" : format(video_df['viewCount'].median() / 1000000,".2f"),
"Max" : format(video_df['viewCount'].max() / 1000000,".2f"),
"Min" : format(video_df['viewCount'].min() / 1000000,".2f")
}]
pd.DataFrame(views_data).style.set_caption("Views in Millions")
| Mean | Median | Max | Min | |
|---|---|---|---|---|
| 0 | 12.77 | 10.43 | 88.46 | 0.89 |
Now we'll visualize the relationship between views and likes as well as comments per video.
# Creates scatter plot and sets x and y values, as well as hue and size parameters
ax = sns.scatterplot(
data = video_df,
x = 'viewCount',
y = 'likeCount',
hue = 'commentCount',
size = 'commentCount',
sizes = (10, 550),
palette = 'crest'
)
# Divides x-axis values by 1,000,000 to convert views to millions
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos:'{:,.0f}'.format(x/1000000) + ' M'))
# Divides y-axis values by 1,000 to convert likes to thousands
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda y, pos:'{:,.0f}'.format(y/1000) + ' K'))
# Title
plt.title('Views, Likes, and Comments', fontsize='40', pad=30)
# Labels
plt.xlabel('Views', fontsize='25', labelpad=20)
plt.ylabel('Likes', fontsize='25', labelpad=20)
# Legend
plt.legend(title='Comments', fontsize='15', title_fontsize='20')
<matplotlib.legend.Legend at 0x15e5aa6d0>
From this scatterplot we can see that there is a positive correlation between the views a video has and its amount of likes and comments. This makes sense as videos with more views have a higher amount of viewers that can interact with it.
Two videos seem to have some interesting data:
Next, let's create a histogram of the duration of all videos.
# Creates scatter plot and sets x and y values, as well as parameters
sns.histplot(
data = video_df,
x = 'durationSecs',
bins = 30,
color = 'mediumseagreen',
alpha = 0.5
)
# Labels
plt.xlabel('Duration (s)', fontsize='25', labelpad=20)
plt.ylabel('Count', fontsize='25', labelpad=20)
# Title
plt.title('Video Duration', fontsize='40', pad=30)
Text(0.5, 1.0, 'Video Duration')
Let's implement some natural language processing (NLP) by creating a word cloud of the most common words in all video titles.
stop_words = set(stopwords.words('english'))
video_df['title_no_stopwords'] = video_df['title'].apply(lambda x: [item for item in str(x).split() if item not in stop_words])
all_words = list([a for b in video_df['title_no_stopwords'].tolist() for a in b])
all_words_str = ' '.join(all_words)
wordcloud = WordCloud(width = 2000, height = 1000, random_state=1, background_color='white',
colormap='crest', collocations=False).generate(all_words_str)
plot_cloud(wordcloud)
Lastly, we'll create a dataframe and barchart that visualizes the upload schedule of all videos based on days of the week.
# Creats dataframe based on the amount of videos uploaded per day
day_df = pd.DataFrame(video_df['pushblishDayName'].value_counts())
# Orders days of the week manually
weekdays = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Redifines the dataframe index based on days of the week
day_df = day_df.reindex(weekdays)
day_df
| count | |
|---|---|
| pushblishDayName | |
| Monday | 7.0 |
| Tuesday | 59.0 |
| Wednesday | 18.0 |
| Thursday | 56.0 |
| Friday | 6.0 |
| Saturday | NaN |
| Sunday | 52.0 |
# Creats barplot and sets x and y values as well as color
ax = day_df.reindex().plot.bar(
y = 'count',
color = "mediumseagreen",
alpha = 0.5,
edgecolor = 'black'
)
# Removes legend
ax.get_legend().remove()
# Labels
plt.xlabel('')
plt.ylabel('Count', fontsize='25', labelpad=20)
# Title
plt.title('Uploading Schedule', fontsize='40', pad=30)
Text(0.5, 1.0, 'Uploading Schedule')
From this barchart it's clear that this channel uploads most of its videos on Tuesday, Thursday, and Sunday, with no videos uploaded on a Saturday.
Dataframes were be imported into PowerBI to create this dashboard.
Some possible improvements to this project could be:
This proyect is based on 'Youtube API for Python: How to Create a Unique Data Portfolio Project ' by Thu Vu data analytics on YouTube. Source: https://youtu.be/D56_Cx36oGY
Nov 29 2023 -- Version 1.0